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Abstract 

Statistical machine translation models have made 
great progress in improving the translation qual¬ 
ity. However, the existing models predict the tar¬ 
get translation with only the source- and target-side 
local context information. In practice, distinguish¬ 
ing good translations from bad ones does not only 
depend on the local features, but also rely on the 
global sentence-level information. In this paper, 
we explore the source-side global sentence-level 
features for target-side local translation prediction. 
We propose a novel bilingually-constrained chunk- 
based convolutional neural network to learn sen¬ 
tence semantic representations. With the sentence- 
level feature representation, we further design a 
feed-forward neural network to better predict trans¬ 
lations using both local and global information. 
The large-scale experiments show that our method 
can obtain substantial improvements in translation 
quality over the strong baseline: the hierarchical 
phrase-based translation model augmented with the 
neural network joint model. 


1 Introduction 

In the recent years, statistical mac hine translation (SM T) 
models, such as phrase-based mo dels | Koehn et aL, 2007) , hi¬ 
erarchical phrase-based models I Chiang, 2007], and linguis¬ 
tically syntax-based models I Liu et aL, 2006 Galley et aL, 
20061, have achieved great progress in improving the transla¬ 


tion performance. In these translation models, the target sen¬ 
tence is generated by compositing several local translations 
with reordering models or synchronous grammars, and the lo¬ 
cal translations are rendered with the help of the source- and 
target-side local context information. In most cases, the trans¬ 
lation of source language words can be determined with local 
context features. However, there are many cases in which the 
target translation does not only depend on the local context, 
but also rely on the global sentence-level information. 

Take the Chinese sentence and its English translation in 
Figure 1 as an example. For the Chinese word in red color, 
molecule is the most possible translation. Even with the help 
of the local context information, it cannot figure out the cor¬ 
rect translation. Given the sentence-level semantics talking 


Source: S# ^ , ft ttA H* . 

//\\\l| l/|//l\\ 

Word Trans: this not-only prevent not safe of molecule , also is prevent those illegal immigrant enter japan . 


Ref: this will not only prevent dangerous people , but also prevent those illegal immigrants from entering Japan. 


Figure 1: An example for Chinese to English translation in 
which the translation of the Chinese word in red needs the 
global sentence information of the Chinese sentence. 


about preventing illegal immigrants and unsafe people from 
entering Japan, we can make sure that the best target transla¬ 
tion for the Chinese word in red color should be people. Ob¬ 
viously, the global sentence-level semantic information plays 
an important role in local translation prediction. 

Two questions arise: 1) how to represent the global 
sentence-level semantics? 2) how to make full use of the sen¬ 
tence semantic representation in statistical machine transla¬ 
tion models? 

For the first problem, the neural network methods are 
proposed recently to learn sentence represent ations. These 
methods include recurrent neural networks QKalchbrenner 
and Blunsom, 20T3] ISutskever et aL, 20141 Bahdanau et 
aL, 2014[, recursiro neural networks IjSocher et aL, 2011 


Socher et aL, 2013a Socher et aL, 2013b[ , sentence to vec¬ 
tor ||Le and Mikolov, 2014| and convolutional neural net- 


works [Kim, 2014[|Kalchbrenner et aL, 2014l|Hu et aL, 2014 

Zeng et aL, 2014) . It should be noted that most of the above 
approaches learn the distributed sentence representations for 
specific tasks, such as classification, sequence labelling, and 
structure prediction. The semantic meaning of the sentence 
is not fully contained by the sentence representation. In this 
paper, we focus on learning the sentence semantic represen¬ 
tation. Although we have no gold sentence semantic repre¬ 
sentation available for supervised learning, we have a large 
amount of parallel sentence pairs which share the same se¬ 
mantic meaning. Accordingly, sentence translation equiva¬ 
lents can supervise each other to learn their semantic repre¬ 
sentations. Furthermore, we design the chunk-based convo¬ 
lutional neural network in order to well handle the sentence 
length variation and retain as much information as possible 
at the same time. In this network, we just need to decide 
how many chunks we want to segment a sentence. Therefore, 
combining the two ideas together, we propose the bilingually- 
constrained chunk-based convolutional neural network (BC- 
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Figure 2: Chunk-based convolutional neural network architecture for modelling sentence representation. 


CNN) for sentence semantic representation. 

For the second problem, we incorporate the sentence se¬ 
mantic representation during the decoding process to better 


generate the targ et translation. Following the idea in I De¬ 


vlin et ai, 20141, we design a feed-forward neural network 


which takes the learnt sentence semantic representation as 
the new input feature to predict the conditional probability 
of the target word given both the local and global informa¬ 
tion. As an informative feature, this conditional probability is 
integrated into the log-lin ear model of th e hierarchical phrase- 
based translation system i Chiang, 2007) . 

In this paper, we make the following contributions: 


• Our idea of the bilingually-constrained method circum¬ 
vents the problem of lacking gold labelled data and pro¬ 
vides a good way to learn sentence semantic representa¬ 
tions. 


• To deal with the variable length of the sentences and 
meanwhile retain as much semantics as possible, we pro¬ 
pose the chunk-based convolutional neural network in 
which we can choose the number of the chunks. 


• When integrating the sentence semantic representation 
into the decoding process, we can achieve significant im¬ 
provements in translation quality over a strong baseline. 


2 Sentence Semantic Representation 

Convolutional neural network (CNN) consisting of the con- 
volution and pooling l ayers provides a standard architecture 
I Collobert et a/., 20111 which maps variable-length sentences 
into fixed-length distributed vectors. This section starts with 
introducing a new variant of the standard CNN called chunk- 
based CNN in order to keep more semantics of the sentence. 


2.1 Chunk-based Convolutional Neural Network 

The model architecture is illustrated in Figure 2. The data 
flow is similar to the standard CNN: the model takes as in¬ 
put the sequence of word embeddings in a sentence, sum¬ 
marizes the sentence meaning by convolving the sliding win¬ 
dow and pooling the saliency through the sentence, and yields 
the fixed-length distributed vector with other layers, such as 
dropout layer, more convolution and pooling layers, linear 
and non-linear layers. 


Specifically, assuming we are equipped with a word 
embedding matrix L € trained on the large- 

scale mono lingual data using unsup ervised algorithm (e.g. 
word2vec i Mikolov et ai, 2013[ ). Given a sentence 
W 1 W 2 ■ ■ ■ Wn, each word Wi is first projected into a vector Xi 
through the word embedding matrix. Then, we concatenate 
all the vectors in order to form the input of the model: 


X = [Xi,X2,-' 




( 1 ) 


Convolution Layer involves a number of filters W G 
j^hxk summarize the information of /i-word window 

and produce a new feature. For the window of h words 
Xi-i+h-i, ^filter Fi (1 < I < L, and L denotes the num¬ 
ber of filters) generates the feature yl as follows: 

y\ = a(W ■ + b) (2) 


Where u is a non-linear activation function (e.g. Relu or Sig¬ 
moid), and 6 is a bias term. When filter traverses each win¬ 
dow in the sentence from Xi-h-i to Xn-h+i-.n, we get the 
output of the feature map corresponding to filter Fy. 

y‘ = ,yl-h+i\ (3) 

Here, y^ G It should be noted that the sentences dif¬ 

fer with each other in length n (from several words to more 
than 100 words), and then y^ has different dimensions for dif¬ 
ferent sentences. It becomes a key question how to transform 
the variable-length vector y^ into a fixed-length vector. 

Pooling Layer is designed to perform this task. At most 
cases, we apply a stantod max-ove r-time pooling operation 
I Collobert et ai, 2011 Kim, 2014 1 over y^ and choose the 
maximum value = max{y'‘} as the most important feature 
of the filter F). This idea is simple and easy to implement. 
However, its disadvantage lies in two-fold: 1) most of the 
information in the se ntence is lost, and 2) the w ord order in¬ 
formation is missing. Kalchbrenner et al. 1 2014) proposed to 
take the top-AT maximum values over y* to keep more infor- 
mation, but the word order information is still missing. Hu 
et al. 1 2014) designed a max-pooling over every two-unit, but 
they require the fixed-length inputs. In this paper, we de¬ 
sign a simple but effective chunk-based max-pooling so as 
to retain more semantics of the sentence and keep the order 
information as well. 

’fc is the embedding dimension and |F| is the vocabulary size. 
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Figure 3: Bilingually-constrained Chunk-based convolutional neural network architecture for learning sentence semantic rep¬ 
resentation. 


Given the predefined number of chunks C (e.g. C = 4 
in Figure 2), we first divide evenly i/ into C segments, and 
then takes the maximum value from each segment. Note that 
n — h-\-l does not have to be divisible by C and the last chunk 
can have the size of the modulus. Then, we can transform the 
variable-length vector into a fixed-length vector with C 
values: 

= chunkMax{y’'} 

= chunkMax{[y[,y[, - ■ ■ (4) 

= [j/ci:l/c2r- • ,2/c] 


If there are filters, the output of the pooling layer will be a 
vector in X C dimension. In our model, the pooling layer 
is followed by a dropout layer, two fully connected linear lay¬ 
ers with non-linear activation Relu. Finally, we can obtain a 
fixed-length output vector for each sentence. 

This idea of chunk-based CNN is inspired by the inherent 
structure of a sentence. From the perspective of shallow struc¬ 
tures, a sentence is organized by subject-verb-object (English 
and Chinese word order). Then, we can assign C with a small 
number (e.g 2 or 3) to summarize this kind of semantic infor¬ 
mation. From the perspective of deep structures, a sentence 
can be described as a sequence of noun phrase (NP), verb 
phrase (VP), adjective phrase (ADJP), prepositional phrase 
(PP) and so on. Accordingly, we can set C to be a relatively 
large number to capture these kinds of information. There¬ 
fore, the chunk-based CNN can retain more semantics of the 
sentence and maintain as well the sentence structure to some 
extent. 


2.2 Bilingually-constrained Chunk-based CNN 


The convolutional neural network is usually tuned to optimize 
an objective function for a specific task, such as sentiment and 


relation classification I Kim, 2014 


Zeng et al, Imi) . The 


result sentence representation is class sensitive but does not 
encode adequate semantic meaning of the sentence. Since our 
goal is to learn sentence semantic representations, we need to 
find a well-defined objective function. 

However, there are no gold sentence semantic represen¬ 
tations available in the real world. Fortunately, we know 
the fact that if two sentences share the same meaning, their 
semantic representations should be identical. As we know 
in machine translation that, there are lots of parallel sen¬ 
tence pairs for different languages, such as Chinese-English, 
Arabic-English. Thus, we can make an inference from this 
fact that if a model can learn the same representation for 
any parallel sentence pair sharing the same meaning, the 
learnt representation must encode the semantics of the sen¬ 
tences and the corresponding model is our desire. Inspired 
by the work o n word and phrase embeddings using a b ilin- 
gual method ]Zou et al, 2013 Zhang et al., 2014^ , we 
propose the Bilingually-constrained Chunk-based CNN (BC- 
CNN), whose basic goal is to minimize the semantic distance 
between the sentences and their translations. 

As illustrated in Eigure 3, given a source language sentence 
/ and its translation e, the chunk-based CNN can generate 
respectively the fixed-length output vectors Of and Og. Then, 
Of and Oe are projected into a shared semantic space, and 
become O'f and O'^ (e.g. O'f = cr(VF* • Of -F 6‘), where 


TF* denotes transformation matrix). In the shared space, two 
representation vectors can calculate their distance easily with 
dot-product. The basic objective function is to minimize the 

































































































































































distance dis{f, e; 0) = dis{0'p O') between Oj and Og. 

We know that a good model should not only make the rep¬ 
resentations of translation equivalents as similar as possible, 
but also should enforce the representations of non-translation 
pairs as different as possible. Therefore, our objective func¬ 
tion is also designed to maximize the distance dis{f, e*; 0) 
if (/, e*) is a non-translation pair. Then, we design our ob¬ 
jective function to be a max-margin loss; 

j(/i e, e*; 0) = max(0,1 -f dis{f, e*; 0) - dis{f, e; 0)) 

(5) 

Here, 0 includes all the parameters of the bilingually- 
constrained chunk-based CNN. For any translation equivalent 
(/, e), we can choose randomly a sentence e* (e* ^ e) from 
the target language monolingual data and obtain the non¬ 
translation pair (/, e*). The finally objective function over 
the large-scale parallel sentence pairs [F, E) of size N will 
be: 

J(F,T;;0) = 1 ^ j(/,e,e*;0) + ^||0||" (6) 

3 Integrating Sentence Semantic 

Representations in Translation Models 

With the trained chunk-based convolutional neural network, 
we can obtain the semantic representation for any sentence. 
In this section, we introduce how to make full use of the sen¬ 
tence semantic representations in statistical machine transla¬ 
tion models. 


output 



Figure 4: Local translation prediction with both the local con¬ 
text and the global sentence semantic representations. 


ti are translations with each other). Devlin et al. 1 2014 1 take 
the m-word source-side local context centered at Sa : 


• • • , Sa-, (9) 


However, just as we discussed in the Introduction section 
that besides the local context, the global sentence semantic 
information plays an indispensable role in accurate transla¬ 
tion prediction. Therefore, we augment Equation 8 with the 
global sentence semantics: 


Id 

p{t\s) « Y[p{ti\ti-n+ir ■ ■ ,L-l,5j,s) (10) 

i=l 


3.1 Sentence Representation for Translation 
Probability Estimation 

Formally, given a source sentence s, machine translation aims 
to find from the search space T the best target translation 
hypothesis t which has the highest conditional probability 
p{t\s). If we focus on each target word, the conditional prob¬ 
ability can be decomposed as follows: 

Id 


In this way, translating every target word during decoding is 
aware of the source-side global sentence-level information. 

In our experiments, following | Devlin et al, 2014) we use 


n = 4 and m = 11. It is easy to see that the data sparsity 
will become a serious problem if we employ the traditional 
method to perform the probability estimation. Therefore, we 
resort to the neural networks that perform the probability es¬ 
timation in a distributed continuous space. 

The neural network architecture shown in Figure 4 is al- 


Y[piU\ti,t2,- ■ 
2=1 


(' 7 ') most identical to the original feed-forward neural network 


described in iBengio et al, 2003 

Vaswani et al., 2013 


Devlin et al. 1 2014) approximated the target single word 
probability p{ti\ti,t 2 , ■ ■ ■ following the target n- 

gram language model and using the source-side local context, 
called joint model; 


Devlin et al., 2014). It consists of two hidden layers besides 


Id 


p(f I s) ~ P(li |L—n-t- 1 7 ' ' ' 7 — 1 7 ) 


(8) 


In which Si includes the source-side local context associated 
with the current target word ti. We know that machine trans¬ 
lation models, such as hierarchical phrase-based model, gen¬ 
erate the target hypothesis with translation rules|^from which 
we can hnd the source word Sa^ that is aligned to L (sa^ and 


^Word alignment information in the translation rules are retained 
during decoding. 


the input and output layer. The input in cludes two parts: 1) 
the local m+n — 1 context vector used in | Devlin eta/., 2014 1 
(n — 1 target history words and m source-side context words), 
where each word is mapped to a 192-dimensional vector; 2) 
the global 192-dimensional sentence semantic representation 
vector obtained with the learnt chunk-based CNN. 

Through two 512-dimensional hidden layers with rectified 
linear activation function (Relu, a{x) = max{0,x)), we ap¬ 
ply the softmax function in the output layer to calculate the 
probability for each target word in the vocabulary. Following 
I Devlin efg/., 2014) , input vocabulary contains 16,000 source 
words and 16,000 target words. The output vocabulary con¬ 
tains 32,000 target words. 

This feed-forward neural network will be trained to max¬ 
imize the log-likelihood over the target side of the bilingual 
training data for machine translation. 





































3.2 Decoding with Neural Network Probability 

To calculate the target word conditional probability with neu¬ 
ral network, we need the information of n — 1 target history 
words, m source context words and the whole source sen¬ 
tence, which are all easy to obtain during SMT decoding. 
Thus, the neural network probability can be integrated into 
any SMT mod el. In this pape r, the hierarchical phrase-based 
model (HPB) I Chiang, 2007) is employed. 

The HPB model translates the source sentences using syn¬ 
chronous context free grammars (SCFG). The SCFG trans¬ 
lation rules are in the form of X — {a, 7 , ~), where X is 
the non-terminal symbol, a and 7 are sequences of lexical 
items and non-terminals in the source and target side respec¬ 
tively, and ^ indicates one-to-one correspondence between 
non-terminals in a and 7 in the standard version. Here, in or¬ 
der to retrieve easily the central source word aligned to each 
target word during decoding, we also retain the alignments 
between source- and target-side lexical terms in the rules. 
That is, ^ contains all the correspondences between termi¬ 
nal or non-terminals in a and 7 . 

The HPB SMT adopts a log-linear model to search for the 
best translation candidate. The powerful features in the log- 
linear model includes; 1 ) forward and backward rule prob¬ 
ability, 2) forward and backward lexical probability, 3) a 5- 
gram language model, 4) rule counts and translation length 
penalty, and 5) a glue rule reordering model. The condi¬ 
tional probability p{ti\ti-n+i, ‘' • calculated in 

the previous section will serve as the sixth kind of the infor¬ 
mative feature to be integrated in the log-linear model. 


4 Experiments 

Before elaborating the experimental results, we first introduce 
the details of neural network training and the experimental 
settings. 


4.1 Neural Network Training Details 

For the bilingually-constrained chunk-based CNN, the ini¬ 
tial 192-d imensional word em beddings are trained with 
word2vec iMikolov et ai, 20131; the embedding for English 


words is learnt on ~1.1B data, while that for Chinese words 
is learnt on ~0.7B data. We set the context window = 3 for 
convolution. We will test multiple settings of the chunk num¬ 
ber (C = 1, 2,4, 8 ) to see which one performs best. We apply 
L = 100 filters. The two fully connected linear layers both 
contain 192 neurons. The dropout ratio in the dropout layer 
is set 0.5 to prevent overfitting. Dot-product is employed to 
calculate the distance between source and target sentence rep¬ 
resentations in the shared semantic space. The standard back- 
propagation and stochastic gradient descent (SGD) algorithm 
is utilized to optimize this network. 

For the feed-forward neural network, we also apply the 
SGD algorithm. A key issue of this neural network is that the 
computation in the softmax layer is too time consuming since 
normaliza tion is required over the entire huge vocabulary. In¬ 
spired by |Vaswani et ai, 2013|, we adopt the Noisy Con- 
trastive Estimation (NCE) | |Gutmann and Hyvarinen, 2010) 
to avoid the normalization in the output layer. 


Data 

Chinese Sent. Num. 

English Sent. Num. 

bilingual data 

2,086,731 

2,086,731 

Xinhua News 


10,912,683 

NIST03 

919 

919x4 

NIST05 

1,082 

1,082x4 

NIST06 

1,000 

1,000x4 

NIST08 

691 

691x4 


Table 1: Data statistics of the SMT experiment. 


4.2 Experimental Setup 

The SMT evaluation is conducted on Chinese-to-English 
translation. The bilingual training data from LDC contains 
about 2.1 million sentence pairs. This bilingual data is also 
utilized to train the two neural networks. The 5-gram lan¬ 
guage model is trained on the English part of the bilingual 
training data and the Xinhua portion of the English Giga- 
word corpus. NIST MT03 is used as the tuning data. MT05, 
MT06 and MT08 (news data) are used as the test data. Table 
1 shows the detailed data statistics. Case-insensitive BLEU 
is employed as the evaluation metric. The statistical signif¬ 
icance test is perform ed with the pairwise re-sampling ap¬ 
proach jKoehn, 2004) . 


4.3 Experimental Results 

To have a comprehensive understanding about the capacity 
of our proposed model, we compare our method with several 
baselines. The information of different systems is detailed as 
follows: 

• HPB: the hierarchical phrase-based translation system. 

• -hNNJM; HPB system incorporating feed-forward neu¬ 
ral network joint model in which the probability is pre¬ 
dicted with 3 target history words and 11 source-side lo¬ 
cal context words. 

• -hAVE_SENT: it is similar to h-NNJM, but the neural net¬ 
work probability is augmented with source-side global 
sentence representation which is obtained by averaging 
all the word embeddings in the sentence. 

• -hBCCNN; it is similar to h-AVE_SENT. Instead of av¬ 
erage embedding, the sentence semantic representation 
is learnt using the bilingually-constrained chunk-based 
convolutional neural networks. 


Table 2 gives the detailed results. Eirst, let us look at the 
performance of the neural network joint model using the lo¬ 
cal contexts (h-NNJM). Compared to the hierarchical phrase- 
based model HPB, this model performs significantly better 
on test set MT05 and MT0 8. The biggest impr ovement can 
be up to 0.95 BLEU score. Devlin et al. 1 2014) has reported 
that this model can outperform HPB by more than 1.0 BLEU 
score on Chinese-to-English translation. Although our im¬ 
provement is not so promising, we demonstrate that it is much 
helpful to apply the neural network joint model using source- 
and target-side local contexts. 


3LDC2000T50, LDC2002L27, LDC2002T01, LDC2003E07, 
LDC2003E14, LDC2003T17, LDC2004T07, LDC2005T06, 

LDC2005T10 and LDC2005T34. 




























System 

MT03 

MT05 

MT06 

MT08 

HPB 

35.98 

34.66 

35.25 

27.80 

-hNNJM 

36.93 

35.55+ 

35.77 

28.64+ 

-hAVE_SENT 

37.16 

35.88+ 

36.07+ 

29.19+ 

-tBCCNN-1 

37.32 

36.06+ 

36.42+ 

29.35+* 

-tBCCNN-2 

37.75 

36.24+ 

36.65+* 

29.97+* 

-hBCCNN-4 

37.98 

36.22+ 

36.78+* 

30.02+* 

-hBCCNN-8 

37.64 

36.29+* 

36.49+ 

29.98+* 


Table 2: Experimental results of different translation sys¬ 
tems. Significance test is performed on the test sets. 
means that the model significantly outperforms the baseline 
HPB with p < 0.01. indicates that the model is sig- 
nihcantly better than h-NNJM with p < 0.01. ”h-BCNN- 4” 
denotes that this model adopts 4 chunks in the pooling layer. 


When the model h-NNJM is augmented with the global sen¬ 
tence representations, we can obtain more gains (see last 5 
lines in Table 2). Specifically, if the sentence representation 
is generated by just averaging all the word embeddings, it 
can get slight improvements over the model h-NNJM. How¬ 
ever, h-AVE_SENT cannot perform significantly better than 
h-NNJM. These results indicate that the sentence representa¬ 
tion is beneficial to improving the translation quality. Due 
to the lack of adequate semantics, it does not lead to great 
improvements. 

As we can see that if the sentence representation is learnt 
by the bilingually-constrained chunk-based CNN (BCCNN), 
the models can achieve much more BLEU score improve¬ 
ments no matter how many chunks we adopt. Note that using 
only one chunk is equivalent to the max-over-time pooling. 
Erom Table 2 we see that more chunks perform better than 
the max-over-time pooling. The results indicate that partition¬ 
ing the sentence into several chunks and summarizing respec¬ 
tively their important semantics can result in better sentence 
semantic representations. 

Overall, the model using 4 chunks (h-BCCNN- 4) performs 
best. It obtains three bests out of four sets and it significantly 
outperforms the model h-NNJM on test sets MT06 and MT08. 
The largest gains can be up to 1.38 BLEU score on MT08. 
It is interesting that the model with 8 chunks just performs 
similarly to that with 2 chunks. We speculate that too many 
chunks may bring some noise. However, it deserves deep in¬ 
vestigation. Nevertheless, we can say that the global sentence 
semantic representation much benefits the target translation 
prediction. 

5 Related Work 

Our work mainly includes two key issues: one is learning 
the sentence semantic representation, and the other is apply¬ 
ing global sentence-level information to better predict target 
translations. We will discuss the related work from these two 
aspects. 


On sentence representation learning, many researchers per- 



code from the source sentence representation. And it takes 


long time to train their models. [Socher et al.\ Socher et 


al. Socher efaZ. 12011 2013a[ 2013b| designed the recursive 


neural networks for syntactic parsing and sentiment analysis. 


in which senten ce representation is the by-product. Le and 
Mikolov 12014) used a simple feed-forward neural network to 


learn sentence and paragraph representations, but one disad 
vantage is that test sentence representation must be learnt by 


performing t he training proce ss. Kim| ||2014 


et al. 1 2014) , Hu et al. | |2014) and Zeng et al. 


, iKalchbrenner 

| 2014| adopted 


the convolutional neural networks to learn sentence represen¬ 
tations for different classification tasks. 

The sentence representations learnt with the above meth¬ 
ods are mainly task dependent. Eor example they are s ensitive 
to sentiment (relation or s t ructure) of the sen tence [Socher 


et al., 2013b [Kim, 20T4 Zeng et al, 2014) . These sen¬ 


tence representations do not encode adequate semantics of the 
sentence. In contrast, we aim at encoding as much seman¬ 
tics as possible in the sentence representation by designing 
the bilingually-constrained chunk-based convolutional neural 
networks. 

On applying more information for translation prediction. 


Devlin et al. 12014) developed a neural network joint model 


to make full use of the source- and target-side local contexts. 
However, they ignored the global sentence-level information. 
Recently, some researchers explo ited the informat i on bey ond 
the sentence lev el. Eor example, Eide lman et al. |2012j, Su 


me sentence lev ei. ror example, Ifciaelman et al.\ | |/UtZ| , |au 
et al. 12012) and Zhang et a/.(|2014b| attempted to apply the 


topic information of the document to distinguish good trans¬ 
lation rules from bad ones during SMT decoding. But, their 
methods require the sentence’s document information which 
is difficult to satisfy in practice. Instead, we just focus on 
the sentence-level features. We have designed two neural net¬ 
works: one for sentence semantic representation learning, and 
the other for target word probability prediction. 


6 Conclusions and Future Work 

In this paper, we have explored the source-side global sen¬ 
tence representations for target translation prediction. We 
presented a new bilingually-constrained chunk-based convo¬ 
lutional neural network to learn sentence semantic represen¬ 
tations. In order to integrate the sentence representation in 
SMT model, we further applied a feed-forward neural net¬ 
work joint model to better predict translation probability with 
both local and global information. The extensive experiments 
have shown that the proposed model can signihcantly outper¬ 
form the strong hierarchical phrase-based translation model 
enriched with broader context features. 

At present, the number of chunks in the chunk-based CNN 
is determined manually. We plan to analyse the sentence 
structure and design a theoretical way to choose the number 
of chunks. We also would like to apply our sentence semantic 
representations to other tasks such as question answering and 
paraphrase detection. 
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